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Abstract 



Benchmarking Declarative Approximate Selection Predicates 

Oktie Hassanzadeh 
Master of Science 
Graduate Department of Computer Science 
University of Toronto 
2007 

Declarative data quality has been an active research topic. The fundamental principle 
behind a declarative approach to data quality is the use of declarative statements to 
realize data quality primitives on top of any relational data source. A primary advantage 
of such an approach is the ease of use and integration with existing applications. 

Over the last couple of years several similarity predicates have been proposed for 
common quality primitives (approximate selections, joins, etc.) and have been fully ex- 
pressed using declarative SQL statements. In this thesis, new similarity predicates are 
proposed along with their declarative realization, based on notions of probabilistic infor- 
mation retrieval. Then, full declarative specifications of previously proposed similarity 
predicates in the literature are presented, grouped into classes according to their primary 
characteristics. Finally, a thorough performance and accuracy study comparing a large 
number of similarity predicates for data cleaning operations is performed. 
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Chapter 1 
Introduction 



The importance of data cleaning and quality technologies for business practices is well 
recognized. Data cleaning has been an active research topic in several communities 
including statistics, machine learning and data management. The quality of data suffers 
from typing mistakes, lack of standards for recording database fields, integrity constraints 
that are not enforced, inconsistent data mappings, etc. For years, data quality technology 
has grown independently from core data management. Data quality tools became part 
of Extract Transform Load (ETL) technologies, commonly applied during the initial 
loading phase of data into a warehouse. Although this might be a viable approach for 
data analytics, where data processed are static, it is far from acceptable for operational 
databases. Dynamic databases however, face proliferating quality problems, that degrade 
common business practices. 

Recently, there has been a major focus on tighter integration of data quality tech- 
nology with database technology. In particular there has been research work on the 
efficient realization of popular data cleaning algorithms inside database engines as well 
as studies for the efficient realization of data quality primitives in a declarative way. The 
approaches are complementary, the former assuring great performance and the latter 
ease of deployment and integration with existing applications without modification of 
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the underlying database engine. We are concerned with declarative implementations of 
data quality primitives in this thesis. In particular we study declarative realizations of 
several similarity predicates for the popular approximate (flexible) selection operation for 
data de-duplication [201 EH]- A similarity predicate sim() is a predicate that numerically 
quantifies the 'similarity' or 'closeness' of two (string) tuples. Given a relation R, the 
approximate selection operation using similarity predicate sim(), will report all tuples 
t G R such that sim(t q , t) > 6, where 9 a specified numerical 'similarity threshold' and t q 
a query tuple. Approximate selections are special cases of the approximate join (record 
linkage, similarity join) operation [201 HE]. Several efficient declarative implementations 
of this operation for specific similarity predicates have been proposed [20, HE] both for 
approximate selections and joins. 

In this thesis, we conduct a thorough study of declarative realizations of similarity 
predicates for approximate selections. We introduce and adapt novel predicates, realize 
them declaratively and compare them with existing ones for accuracy and performance. 
In particular we make the following contributions: 

• Inspired by the success of tf-idf cosine similarity from information retrieval [26] as 
a similarity predicate for approximate selections, we introduce declarative realiza- 
tions of other successful predicates from information retrieval and in particular the 
popular BM25 measure. 

• We introduce declarative realizations of probabilistic similarity predicates inspired 
by Language Models from information retrieval [22] and Hidden Markov Models 
[2"T] . suitably adapted for the case of approximate selections. 

• We present declarative realizations of previously proposed similarity predicates for 
the approximate selection problem and we propose a categorization of all measures 
both previously proposed and new according to their characteristics. 

• We present a thorough experimental study comparing all similarity predicates for 
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accuracy and performance, under various types of quality problems in the under- 
lying data. 



Chapter 2 



Related Work 



Data quality has been an active research topic for many years. A collection of statistical 
techniques have been introduced initially for the record linkage problem [9J [8] . The bulk 
of early work on data quality was geared towards correcting problems in census files 
[29] . A number of similarity predicates were developed taking into account the specific 
application domain (i.e., census files) for assessing closeness between person names (e.g., 
Jaro, Jaro- Winkler [15], [29], etc). 

The work of Cohen [6] introduced the use of primitives from information retrieval 
(namely cosine similarity, utilizing tf-idf [26]) to identify flexible matches among database 
tuples. A performance/accuracy study conducted by Cohen et al., [7] demonstrated that 
such techniques outperform common predicates introduced for specific domains (e.g., 
Jaro, Jaro- Winkler, etc). 

Other techniques geared towards database tuples include the merge/purge technique 
[T4] . Several predicates to quantify approximate match between strings have been utilized 
for dealing with quality problems, including edit distance and its variants [15] . Hybrid 
predicates combining notions of edit distance and cosine similarity have also been in- 
troduced [HH]. Recently, [SI E] presented SSJOIN, a primitive operator for efficient set 
similarity joins. Utilizing ideas from [28], such an operator can be used for approxi- 
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mate matching based on a number of similarity functions, including hamming distance, 
edit-distance and Jaccard similarity. However, the choice of the similarity predicate in 
this approach is limited [2]. The bulk of the techniques and predicates however have 
been introduced without a declarative framework in mind. Thus, integrating them with 
applications utilizing databases in order to enable approximate selections is not very easy. 

Gravano et al. [ITJ [TO], introduced a declarative methodology for realizing approx- 
imate joins and selections for edit distance. Subsequently a declarative framework for 
realizing tf-idf cosine similarity was introduced [12], [161 OS E] • 

There has been a great deal of research in the information retrieval literature on 
weighting schemes beyond cosine similarity with tf-idf weighting. Recent IR research has 
shown BM25 to be the most effective among the known weighting schemes [25]. This 
weighting scheme models the distribution of within-document term frequency, document 
length and query term frequency very accurately. Moreover, in the information retrieval 
literature, language modeling has been a very active research topic as an alternate scheme 
to weight documents for their relevance to user queries. Starting with Ponte and Croft 
[22] language models for information retrieval have been widely studied. 

Hidden Markov Models (HMM) have been very successful in machine learning and 
they have been utilized for a variety of learning tasks such as named entity recognition 
and voice recognition [23]. They have also been utilized for information retrieval as well 
[21] . An experimental study on TREC data demonstrated that an extremely simple 
realization of HMM outperforms standard tf-idf for information retrieval [21]. Several 
researchers [M] have tried to formally reason about the relative goodness of information 
retrieval weighting schemes. 



Chapter 3 



Framework 



Let Q be a query string and D a string record from a base relation R = {Di : 1 < % < N}. 
We denote by Q, D the set of tokens in Q and D respectively. We refer to substrings 
of a string as tokens in a generic sense. Such tokens can be words or q-grams (sequence 
of q consecutive characters of a string) for example. For Q=db lab', <2={'db', 'lab'} 
for word-based tokenization and Q={'db ' ,'b 1',' la', 'lab'} for tokenization using 3- 
grams. We refer to tokens throughout the thesis when referring to words or q-grams. We 
make the choice specific (word or q-gram) for techniques we present, when is absolutely 
required. In certain cases, we may associate a weight with each token. Several weighting 
mechanisms exist. We present our techniques referring to weights of tokens, making 
the choice of the weighting scheme concrete when required. In chapter we realize our 
techniques for specific choice of tokens and specific weighting mechanisms. 

Our goal is to calculate a similarity score between Q and D using a similarity pred- 
icate. We group similarity predicates into five classes based on their characteristics, 
namely: 

• Overlap predicates: These are predicates that assess similarity based on the 
overlap of tokens in Q, T>. 

• Aggregate Weighted Predicates: Predicates that assess similarity by manipu- 
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lating weights (scores) assigned to elements of Q, T> 

• Language Modeling Predicates: Predicates that are based on probabilistic 
models imposed on elements of Q, V 

• Edit Based Predicates: Predicates based on a set of edit operations applied 
between Q and D. 

• Combination Predicates: Predicates combining features from the classes above. 

The classes were defined by studying the properties of previously proposed similarity 
predicates as well as ones newly proposed herein. The first four classes encompass predi- 
cates introduced previously in various contexts for data cleaning tasks, with the exception 
of BM25 which to the best of our knowledge is the first time that is deployed for data 
cleaning purposes. The Language Modeling class of predicates draws from work on in- 
formation retrieval and is introduced herein for data cleaning tasks. Within each class 
we discuss declarative realizations of predicates. 



3.1 Overlap Predicates 

Suppose Q is the set of tokens in the query string Q and T> is the set of tokens in the 
string tuple D. The Inters ectSize predicate [28] is simply the number of common tokens 
between Q and D, i.e.: 

sim intersect (Q,D) = \QnV\ (3.1) 

Jaccard similarity [28] is the fraction of tokens in Q and S that are present in both, 
namely: 



If we assign a weight w (t 



|Qnx>| 

sim Jaccard (Q, D) = jQypj ( 3 - 2 ) 
to each token t, we can define weighted versions of the above 



predicates. WeightedMatch [28] is the total weight of common tokens in Q and T>, i.e., 



1 Discussion of ways to assign such weights to tokens follows in subsequent chapters. 
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SteQnx> w (t)- Similarly, Weighted J accard is the sum of the weights of tokens in \Q D T>\ 
divided by the sum of the weights of tokens in \ Q U T>\. 

3.2 Aggregate Weighted Predicates 

The predicates in this class encompass predicates widely adopted from information re- 
trieval (IR). A basic task in IR is, given a query, identifying relevant documents to that 
query. In our context, we would like to identify the tuples in a relation that are similar 
to a query string. 

Given a query string Q and a string tuple D, the similarity score of Q and D in this 
class of predicates is of the form sim(Q, D) = J2t&Qnv w qi^y Q) w d(t, D), where w q (t, Q) is 
the query-based weight of the token t in string Q and Wd(t, D) is the tuple-based weight 
of the token t in string D. 

3.2.1 Tf-idf Cosine Similarity 

The tf-idf cosine similarity [25] between a query string Q and a string tuple D is defined 
as follows: 

sim cosine (Q,D) = 22 w q(t,Q) w d{t,D) (3.3) 

t&QfXD 

where w q (t,Q),Wd(t, D) are the normalized tf-idf weights [26]. The normalized tf-idf 
between a token t and a string S, w(t, S) is given by: 

w(t,S) = -= fi t '^ , w'(t,S)=tf(t,S).idf(t) 

V l^t'es w K t ' b > 

The idf term makes the weight of a token inversely proportional to its frequency in the 
database; the tf term makes it proportional to its frequency in S. Intuitively, this assigns 
low scores to frequent tokens and high scores to rare tokens. More discussion is available 
elsewhere [61 H2] • 
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3.2.2 BM25 Predicate 

The BM25 similarity score between a query string Q and a tuple D, is given as: 

sim B M25(Q,D) = w q (t,Q)w d (t,D) (3.4) 
teQnx> 

where 

w (to) - ( fc 3 + i)*t/(t,Q) 

, ™ m, ^Oi + 1) *tf(t,D) 

i// 1 ' is a modified form of Robertson-Sparck Jones weight: 



K(D) = h (1-6) +6 



I-DI 



avgdl J 

and is the number of tuples in the base relation R, n t is the number of tuples in 
R containing the token t, tf(t, D) is the frequency of occurrence of the token t within 
tuple D, \D\ is the number of tokens of tuple D, avgdl is the average number of tokens 
per tuple, i.e. ^ De j^ ^ and k±, k 3 , and b are independent parameters. For TREC-4 
experiments [25], fc x G [1,2], k 3 = 8 and b e [0.6,0.75]. 



3.3 Language Modeling Predicates 

A language model, is a form of a probabilistic model. To realize things concretely, we 
base our discussion on a specific model introduced by Ponte and Croft [22]. Given a 
collection of documents, a language model is inferred for each; then the probability of 
generating a given query according to each of these models is estimated and documents 
are ranked according to these probabilities. Considering an approximate selection query, 
each tuple in the database is considered as a document; a model is inferred for each tuple 
and the probability of generating the query given the model is the similarity between the 
query and the tuple. 
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3.3.1 Language Modeling 

The similarity score between query Q and tuple D is defined as: 

sim LM {Q,D) = p(Q\M D ) = ]Jp(t\M D ) x JJ(1 - p(t\M D )) (3.6) 

where p(t\M£,) is the probability of token t occurring in tuple D and is given as follows: 

p ml (t, _D)( 1 -o-At,o) x p (t)4. c if tf > o 
p(t\M D ) = { (3.7) 

— otherwise 

cs 

p m i{t, D) is the maximum likelihood estimate of the probability of the token t under the 
token distribution for tuple D and is equal to where i/(t,£>) is raw term frequency 

and din is the total number of tokens in tuple D. p aV g{t) is the mean probability of token 
t in documents containing it, i.e., 

Pav 9 (t) = (3.8) 

where df t is the document frequency of token t. This term is used since we only have a 
tuple sized sample from the distribution of M^, thus the maximum likelihood estimate 
is not reliable enough; we need an estimate from a larger amount of data. The term R t ^ 
is used to model the risk for a term t in a document D using a geometric distribution: 

**- forks) x (infer (3 ' 9) 

ftD is the expected term count for token t in tuple D if the token occurred at the 
average rate, i.e., p avg {t) x dip. The intuition behind this formula is that as the tf gets 
further away from the normalized mean, the mean probability becomes riskier to use as 
an estimate. Finally, eft is the raw count of token t in the collection, i.e. Y1d<=r ^/(^> D) 
and cs is the raw collection size or the total number of tokens in the collection, i.e. 
^2 DeR dlD- is used as the probability of observing a non-occurring token. 
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3.3.2 Hidden Markov Models 

The query generation process can be modeled by a discrete Hidden Markov process. 
Figure 13.11 shows a simple yet powerful two-state HMM for this process. The first state, 
labeled "String" represents the choice of a token directly from the string. The second 
state, labeled "General English" represents the choice of a token that is unrelated to the 
string, but occurs commonly in queries. 

Suppose Q is the query string and D is a string tuple from the base relation R] 
the similarity score between Q and D, sirriHMM{Q, D), is equal to the probability of 
generating Q given that D is similar, that is: 

P(Q\D is similar) = Y[(a P(q\GE) + ai P(q\D)) (3.10) 

qeQ 

where: 

P( \D) num ^ er °f times q appears in D (3 11) 

length of D 

V nc p number of times q appears in D , 
P(qGE) = **** 3.12 

and ao and a>\ = 1 — ao are transition probabilities of the HMM. The values for these 
parameters can be optimized to maximize accuracy given training data. 



3.4 Edit-based Predicates 

An important and widely used class of string matching predicates is the class of edit- 
based predicates. In this class, the similarity between Q and D is the transformation 
cost of string Q to D, tc(Q,D). More specifically tc(Q,D) is defined as the minimum 
cost sequence of edit operations that converts Q to D. Edit operations include copy, 
insert, substitute and delete characters in Q and D [TB]. Algorithms exist to compute 
tc(Q, D) in polynomial time [13] but complexity is sensitive to the nature of operations 
and their operands (individual characters, blocks of consecutive characters, etc). The 
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query 
start 



P(q I GE) 




query 
end 



Figure 3.1: Two State Hidden Markov Model 



edit similarity is therefore defined as: 

sim edit (Q,D) 



tc(Q,D) 



(3.13) 



max{|Q|,|D|} 

Edit operations have an associated cost. In the Levenstein edit-distance [13] which we 
will refer to as edit-distance, the cost of copy operation is zero and all other operations 
have unit cost. Other cost models are also possible [13]. 



3.5 Combination Predicates 

We present a general similarity predicate and refer to it as generalized edit similarity 
(GES) (following [5]). Consider two strings Q and D that are tokenized into word tokens 
and a weight function w(t) that assigns a weight to each word token t. The transformation 
cost of string Q to D, tc(Q, D) is the minimum cost of transforming Q to D by a sequence 
of the following transformation operations: 

• token replacement: Replacing word token t\ in Q by word token ti in D with cost 
[1 — sim e dit(ti,t2)] ■ w(t\), where sim ec nt(ti,t2) is the edit similarity score between 
t\ and t2- 
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• token insertion: Inserting a word token t into Q with cost Ci ns ■ w(t) where Ci ns , is 
a constant token insertion factor, with values between and 1. 

• token deletion: Deleting a word token t from Q with cost w(t). 

Suppose wt(Q) is the sum of weights of all word tokens in the string Q. We define the 
generalized edit similarity predicate between a query string Q and a tuple D as follows: 

sim GES {Q, D) — I — min ( ^j^p L0 ) ( 3 - 14 ) 

A related predicate is the SoftTFIDF predicate [7]. In SoftTFIDF, normalized tf-idf 
weights of word tokens are used along with cosine similarity and any other similarity 
function sim(t,r) to find the similarity between word tokens. Therefore the similarity 
score, sim So ftTFiDF(Q, D), is equal to: 

y w(t, Q) ■ f/;(argmax(sM7i(i, r)), D) ■ max(sim(t, r)) (3.15) 

t£CLOSE(0,Q,D) 

where w(t, Q),w(t, D) are the normalized tf-idf weights and CLOSE(9, Q, D) is the set 
of words i 6 Q such that there exists some v G T> such that sim(t, v) > 9. 



Chapter 4 



Declarative Framework 



We now describe declarative realizations of predicates in each class. We present declar- 
ative statements using standard SQL expressions. For all predicates, there is a prepro- 
cessing phase responsible for tokenizing strings in the base relation, R, and calculating 
as well as storing related weight values which are subsequently utilized at query time. 
Tokenization of relation R (BASE_TABLE) creates the table BASE_TOKENS (tid, token), 
where tid is a unique token identifier for each tuple of BASE_TABLE and token an as- 
sociated token (from the set of tokens corresponding to the tuple with identifier tid in 
BASE_TABLE). The query string is also tokenized on the fly (at query time) creating the 
table QUERY_TOKENS (token). 

In the rest of this chapter, we present SQL expressions required for preprocessing and 
query time approximate selections for the different predicates. In some cases, we re-write 
formulas to make them amenable to more efficient declarative realization. The main SQL 
codes are given along with their description here. Appendix A contains detailed SQL 
expressions. 

11 
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INSERT INTO INTERSECT_SCDRES (tid, score) 

SELECT Rl.tid, COUNT (*) 

FROM BASE.TOKENS Rl , QUERY.TOKENS R2 

WHERE Rl. token = R2. token 

GROUP BY Rl.tid 



Figure 4.1: SQL Code for IntersectSize 


INSERT 


INTO JACCARD_SCORES (tid, score) 


SELECT 


SI .tid , COUNTO) / (SI . len+S2 . len-C0UNT(*) ) 


FROM 


BASE_DDL SI, QUERY_TOKENS R2, 




(SELECT COUNT (*) AS len 




FROM QUERY_TOKENS) S2 


WHERE 


SI. token = R2. token 


GROUP BY SI. tid, Sl.len, S2.1en 



Figure 4.2: SQL Code for Jaccard Coefficient 



4.1 Overlap Predicates 

The IntersectSize predicate requires token generation to be completed in a preprocessing 
step. SQL statements to conduct such a tokenization, which is common to all predi- 
cates we discuss, is available in Appendix A. The SQL statement for approximate selec- 
tions with the IntersectSize predicate is shown on Figure 14.11 The Jaccard coefficient 
predicate can be efficiently computed by storing the number of tokens for each tuple 
of the BASE_TABLE during the preprocessing step. For this reason we create a table 
BASE_DDL(tid, token, len) where len is the number of tokens in tuple with tuple-id 
tid. The SQL statement for conducting approximate selections with the Jaccard predi- 
cate is presented in Figure 14.21 

The weighted overlap predicates require calculation and storage of the related weights 
for tokens of the base relation during preprocessing. For the WeightedMatch predicate, 
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INSERT INTO SIM_SCORES (tid, score) 



SELECT 



R1W . tid , SUM (R1W . weight*R2W . weight) 



FROM 



BASEJJEIGHTS R1W, QUERY_WEIGHTS R2W 



WHERE 



R1W. token = R2W. token 



GROUP BY 



RlW.tid 



Figure 4.3: SQL Code for Aggregate Weighted Predicates 



we store during the preprocessing step the weight of each token redundantly with each 
tid, token pair in a table BASE_TOKENS_WEIGHTS(tid, token, weight) in order to 
avoid an extra join with a table BASE_WEIGHT (token, weight) at query time. In order 
to calculate the similarity score at query time, we use SQL statements similar to that 
used for the IntersectSize predicate (shown in Figure ETTI) but replace table BASE_TOKENS 
by BASE_TOKENS_WEIGHTS and COUNTO), by SUM(R1 .weight). 

For the WeightedJaccard predicate, we create during preprocessing a table BASE_DDL(tid, 
token , weight , len) where weight is the weight of token and len is the sum of weights 
of tokens in the tuple with tuple-id tid. The SQL statement for approximate selections 
using this predicate is the same as the one shown in Figure H~2l but COUNT (*) is replaced 
by SUM (weight). 



The SQL implementation of the tf-idf cosine similarity predicate has been presented 
in [12]. During preprocessing, we store tf-idf weights for the base relation in relation 
BASE_WEIGHTS(tid, token, weight). A weight table QUERY_WEIGHTS (token, weight) 
for the query string is created on the fly at query time. The SQL statements in Figure 
14.31 will calculate the similarity score for each tuple of the base table. 



4.2 Aggregate Weighted Predicates 



4.2.1 Tf-idf Cosine Similarity 
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4.2.2 BM25 

Realization of BM25 in SQL involves generation of the table BASE_WEIGHTS (tid , token , 
weight) storing the weights for tokens in each tuple of the base relation. These weights 
(wd(t, D)) consist of two parts that could be considered as modified versions of tf and idf. 
For a complete set of SQL statements implementing the required preprocessing, refer to 
Appendix A. The query weights table QUERY_WEIGHTS (token, weight) can be created 
on the fly using the following sub query: 
(SELECT TF. token, TF.tf * (fc 3 +l) / (fc 3 +TF . tf ) AS weight 
FROM ( SELECT T. token, COUNT (*) AS tf 
FROM QUERY.TOKENS T 
GROUP BY T. token ) TF) 

The SQL statement shown in Figure H~31 will calculate BM25 similarity scores. 

4.3 Language Modeling Predicates 
4.3.1 Language Modeling 

In order to calculate language modeling scores efficiently, we rewrite the formulas and 
finally drop some terms that would not affect the overall accuracy of the metric. Cal- 
culating the values in equations (13.81) and (13.91) is easy. We build the following relations 
during preprocessing: BASE_TF (tid, token, tf) where tf= tf tokened- 
BASE_DL(tid,dl) where dl= dl tid . 

BASE_PML (tid, token, pml) where pml= p mi = tft ° k * n > tid . 
BASE_PAVG (token, pavg) where pavg= p avg (token). 
BASE_FREQ (tid, token, freq) where freq= ft ken,ud- 
BASE_RISK(tid, token, risk) where risk= Rtoken,ud- 
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We omit most of the SQL statements in this chapter for readability. Full SQL state- 
ments are available in Appendix A. In order to improve the performance of the associated 
SQL queries, we rewrite the final score formula of equation (13.61) . as follows: 



P{Q\M D ) 



teQ 



J(l-p(t\M D )) 



x t=f (4-1) 

- p(t\M D )) 

teQ 



We slightly change (14. II) to the following: 



P(Q\M D ) 



.teQ 



Y[(l-p(t\M D )) 

x (4.2) 

J] (l-p(t\M D )) 

teQnv 



This change results in a large performance gain, since the computation is restricted to 
the tokens of the query and the tokens of a tuple (as opposed to the entire set of tokens 
present in the base relation). Experiments demonstrate that accuracy is not considerably 
affected. 

In equation (13. 7p . we only materialize the first part (i.e., values of tokens that are 
present in the tuple D) in the relation BASE_PM during preprocessing (storing the second 
part would result in unnecessary waste of space) . We therefore have to divide all formulas 
that use p(t\Mo) into two parts: one for tokens present in the tuple under consideration 
and one for all other tokens. So we rewrite the first term in equation (14.21) as follows: 

i[p(t\M D )= n p(t\MD)x n p^\ m d) 

teQ teQnv teQ-v 

= n P(t\M D ) x n ^ (4.3) 

teQnv teQ-v 

= n tmo) x 4^% 

teQnv llteQnv cs 

The term Y\ t£ Q in the above formula is constant for any specific query string, so it 
can be dropped, since the goal is to find most similar tuples by ranking them based on 
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INSERT INTO LM_SCDRES (tid, score) 

SELECT Bl.tid2, EXP (Bl. score + B2 . sumcompm) 

FROM (SELECT PI. tid AS tidl, T2.tid AS tid2, 

SUM (LOG (PI. pm)) - SUM (LOG (1 . O-Pl . pm) ) 
- SUM(LOG(Pl.cfcs)) AS score 
FROM BASE_PM PI, QUERY.TOKENS T2 
WHERE PI. token = T2. token 
GROUP BY PI. tid, T2.tid) Bl, 
BASE_SUMCOMPMBASE B2 
WHERE Bl.tidl=B2.tid 



Figure 4.4: SQL Code for Language Modeling 



the similarity scores. Therefore, equation (14. 2 p can be written as follows: 



' p(t\M D ) {(I -p(t\M D )) 



P{Q\M D ) 



teQnv 



\/tev 



teQnv 



TT C A ' (l-p(t\M D )) 

11 CS 



teQnv 



(4.4) 



This transformation allows us to efficiently compute similar tuples by just storing 
p(t\M D ) and & for each pair of t and D. Thus, we create table BASE_PM(tid, token, 
pm, cf cs) where pm = p(token\M t id) and cf cs = c ^ e " as the final result of the prepro- 
cessing step. We also calculate and store the term rivieD^ — v(A^d)) during prepro- 
cessing in relation BASE_SUMCOMPBASE(tid, sumcompm). 

The query-time SQL statement to calculate similarity scores is shown in Figure 14.41 
The subquery in the statement computes the three terms in equation 14.41 that include 
intersection of query and tuple tokens and therefore needs a join between the two to- 
ken tables. The fourth term in the equation is read from the table stored during the 
preprocessing as described above. 
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4.3.2 Hidden Markov Models 

We rewrite equation (13.101) as follows: 



P(Q\D is similar) = (a P{q\GE) + ai P(q\D)) 
qeQ 



n a P(q\GE) 

qeQ 



ai P(q\D) 



a P(q\GE)- 



(4.5) 



For a specific query, the term riggg a oP((l\GE) in the above formula is constant for all 
tuples in the base relation and therefore can be dropped since our goal is to order tuples 
based on similarity to a specific query string. So the modified similarity score will be: 



n aiP{q\D) 



a P( q \GE) 

In Equation 14. 6 1 q G Q changes to q G QnV because P(q\D) = for all q <£T>. Thus we 
can calculate the term (1 + ^p^ge) ) ^ or a ^ tid, token pairs during preprocessing and 
store them as weight in relation BASE_WEIGHTS(tid, token, weight). Notice that the 
term P(q\D) is equal to p m i(q, D) in language modeling; we use a relation BASE_PML(tid, 
token, pml) for it. Calculating P(q\GE) and storing it in relation BASE_PTGE (token, 
ptge) is also fairly simple. The final SQL query for preprocessing and the SQL statements 
for calculating similarity scores, are shown in Figure 14.51 



4.4 Edit-based Predicates 

We use the same declarative framework proposed in [11] for approximate matching based 
on edit-distance. The idea is to use properties of q-grams created from the strings to 
generate a candidate set in a way that no false negatives are guaranteed to exist but the 
set may contain false positives. The set is subsequently filtered by computing the exact 
edit similarity score between the query and the strings in the candidate set. Computing 
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Preprocessing 


INSERT 


INTO BASE_WEIGHTS (t id , token , weight) 


SELECT 


M2.tid, M2. token, 




(1 + (al*M2.pml) / (aO*P2.ptge)) 


FROM 


BASE.PTGE P2, BASE_PML M2 


WHERE 


P2. token = M2. token 


Query 


INSERT 


INTO HMM_SCORES (tid, score) 


SELECT 


Wl.tid, T2.tid, EXP (SUM (LOG (Wl .weight) ) ) 


FROM 


BASE.WEIGHTS Wl , QUERY.TOKENS T2 


WHERE 


Wl. token = T2. token 


GROUP BY T2.tid, Wl.tid 


Figure 4.5: SQL Code for HMM 



the edit similarity score is performed using a UDF. The SQL statements for candidate 
set generation and score calculation are available in [TT| . 



4.5 Combination Predicates 

Since the calculation of the score function for GES (Equation l3.14p between a query string 
and all tuples in a relation could be very expensive, we can first identify a candidate set of 
tuples similar to the methodology used for edit-distance and then use a UDF to compute 
exact scores between the query string and the strings in the candidate set. The elements 
of the candidate set are selected using a threshold 9 and the following score formula which 
ignores the ordering between word tokens. This formula over-estimates sirricEsiQ, D) 

® 

sim J G a S c s ard {Q, D) = — — V w(t) ■ m<vt{-sim Jaccard (t, r) + d g ) (4.7) 
wt{Q) ^ rev q 
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where wt(Q) is the sum of weights of all word tokens in Q, w(t) is the idf weight for 
word token t, q is a positive integer indicating the q-gram length extracted from words 
in order to calculate simj accar( i(t,r) and d q = (1 — 1/ q) is an adjustment term. In 
order to enhance the performance of the operation, we can employ min-wise independent 
permutations [3] to approximate simj accard (ti,t 2 ) in Equation 14.71 Description of min- 
wise independent permutations is beyond the scope of this thesis. This would result in 
substituting sirrij accard with the min-hash similarity sim m f l (ti,t2), which is a provable 
approximation. The resulting metric, GES apx , is shown to be an upper-bound for GES 
in expectation [I]: 



In order to implement the above predicates, we need to preprocess the relation using 
the following methodology: 

• Tokenization in two levels, first tokenizing into words and then tokenizing each 
word into q-grams. Word tokens are stored in relation BASE_TOKENS(tid, token) 
and q-grams are stored in BASE_QGRAMS(t id, token, qgram). 

• Storing idf weights of word tokens in relation BASE_IDF (token, idf) as well as 

the average of idf weights in the base relation to be used as idf weights of unseen 



• Calculating weights related to the similarity employed to compare tokens, i.e., 



number of q-grams for each word token in relation BASE_TOKENSIZE (tid, token, 
len). For GES apx , we have to calculate minhash signatures (required by min-wise 
independent permutations). SQL statements for generating min-hash signatures 
and min-hash similarity scores, sim m h(t,r), are available in Appendix A. 




(4.8) 



tokens. 



sim(t, r). For GES Jaccard employing the Jaccard predicate, this includes storing the 
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INSERT INTO GESAPX_RESULTS(tid, score) 
SELECT MS.tid, 1 . 0/SI . sumidf * 

SUM (I . idf * ( ( (2 . 0/g) *MS .maxsim) + (l-l/g) ) ) 
FROM MAXSIM MS, QUERY_IDF I, SUM.IDF SI 
WHERE MS. token = I. token 
GROUP BY MS.tid, SI. sumidf 

Figure 4.6: SQL Code for GES apx , GES Jaccard 

We omit most of SQL statements inside this chapter. In order to make the presented 
statements more readable, we assume that the following auxiliary relations are available 
to us; in practice, they are calculated on-the-fly as subqueries (refer to Appendix for 
complete queries): 

• QUERY_IDF (token, idf) stores idf weights for each token in the query. Weights 
are retrieved from the base weights relation and the average idf value over all tokens 
in the base relation is used as the weight of query tokens not present in the base 
relation. SUM_IDF (token, sumidf) will store sum of idf weights for query tokens. 

• MAXSIM (tid, token, maxsim) stores the maximum of the similarity scores be- 
tween the tokens in tuple tid and each token in the query. 

The tables above do not have to be computed beforehand, they are rather computed 
on the fly at query execution time. Assuming however they are available, the SQL 
statements for computing the scores for GES apx , GES Jaccar,i are shown in Figure H~6l 

SoftTFIDF can also be implemented similar to GES approximation predicates. Dur- 
ing preprocessing, we need to first tokenize the string into word tokens and store them in 
BASE_TOKENS (tid , token) . Depending on the function used for similarity score between 
word tokens, we may need to tokenize each word token into qgrams as well. We then 
need to store normalized tf-idf weights of tokens in the tuples in the base relation in 
BASE_WEIGHTS(tid, token, weight). 
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Here again, at query time, we assess the final score formula of equation (13.151) . in a 
single SQL statement. For presentation purposes, assume that the following relations 
have been materialized: 

• QUERY _WEIGHTS (token, weight) stores normalized tf-idf weights for each token 
in the query table. 

• CLOSE_SIM_SCORES(tid, tokenl, token2, sim) stores the similarity score of each 
token in the query (token2) with each token of each tuple in the base relation, where 
the score is greater than a threshold 9 [9 specified at query time). Such a score 
could have been computed using a declarative realization of some other similarity 
predicate or a UDF to compute similarity using a string distance scheme (e.g., 
Jaro- Winkler [29]). 

• MAXSIM(tid, token, maxsim) stores the maximum of the sim score for each query 
token among all tids in CLOSE_SIM_SCORES relation. MAXTOKEN(tid, tokenl, 
token2, maxsim) stores argmax rgtid (sim(to/cen2, r)) as well, i.e., the token in 
each tuple in the base relation that has the maximum similarity with a query token 
token2 in CLOSE(8, Q, D) 

Figure |4"7H shows the SQL statement for MAXTOKEN table and the final similarity score 
for SoftTFIDF. 
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INSERT INTO MAXTOKEN (tid, tokenl , token2 ,maxsim) 
SELECT CS.tid, CS. tokenl, 

CS.token2, MS.maxsim 
FROM MAXSIM MS, CLOSE_SIM_SCORES CS 
WHERE CS.tid=MS.tid AND 

CS.token2=MS.token2 AND MS . maxs im=CS . s im 



INSERT INTO Sof tTFIDF_RESULTS (tid, score) 
SELECT TM.tid, SUM (I . weight*WB . weight*TM .maxsim) 
FROM MAXTOKEN TM, 

QUERY_WEIGHTS I, BASE_WEIGHTS WB 
WHERE TM.token2 = I. token AND TM.tid = WB.tid 

AND TM. tokenl = WB. token 
GROUP BY TM.tid 



Figure 4.7: SQL Code for SoftTFIDF - Query time 



Chapter 5 
Evaluation 



We experimentally evaluate the performance of each of the similarity predicates presented 
thus far and compare their accuracy. The choice of the best similarity predicate in terms 
of accuracy highly depends on the type of datasets and errors present in them. The 
choice in terms of performance depends on the characteristics of specific predicates. We 
therefore evaluate the (a) accuracy of predicates using different datasets with different 
error characteristics and the (b) performance by dividing the preprocessing and query 
execution time into various phases to obtain detailed understanding on the relative bene- 
fits and limitations. All our experiments are performed on a desktop PC running MySQL 
server 5.0.16 database system over Windows XP SP2 with Pentium D 3.2GHz CPU and 
2GBs of RAM. 

5.1 Benchmark 

In the absence of a common benchmark for data cleaning, we resort to the definition 
of our own data generation scheme with controlled error. In order to generate datasets 
for our experiments, we modify and significantly enhance the UIS database generator 
which has effectively been used in the past to evaluate duplicate detection algorithms 
[13] . We use the data generator to inject different types and percentages of errors to a 

26 
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clean database of string attributes. We keep track of the source tuple from which the 
erroneous tuples have been generated in order to determine precision and recall required 
to quantify the accuracy of different predicates. The generator allows to create data 
sets of varying sizes and error types, thus is a very flexible tool for our evaluation. The 
data generator accepts clean tuples and generates erroneous duplicates based on a set of 
parameters. Our data generator provides the following parameters to control the error 
injected in the data: 

• the size of the dataset to be generated 

• the fraction of clean tuples to be utilized to generate erroneous duplicates 

• distribution of duplicates: the number of duplicates generated for a clean tuple can 
follow a uniform, Zipfian or Poisson distribution. 

• percentage of erroneous duplicates: the fraction of duplicate tuples in which errors 
are injected by the data generator. 

• extent of error in each erroneous tuple: the percentage of characters that will be 
selected for injecting character edit error (character insertion, deletion, replacement 
or swap) in each tuple selected for error injection. 

• token swap error: the percentage of word pairs that will be swapped in each tuple 
that is selected for error injection. 

We use two different sources of data: a data set consisting of company names and a 
data set consisting of DBLP Titles. Statistical details for the two datasets are shown in 
Table 15.11 For the company names dataset, we also inject domain specific abbreviation 
errors, e.g., replacing Inc. with Incorporated and vice versa. 

For both datasets, we generate different erroneous datasets by varying the parameters 
of the data generator as shown in Table 15.21 
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dataset 


^tuples 


Avg. tuple length 


^words/tuple 


Company Names 


2139 


21.03 


2.92 


DBLP Titles 


10425 


33.55 


4.53 



Table 5.1: Statistics of Clean Datasets 



parameter 


range 


size of dataset 


5k - 100k 


# clean tuples 


500 - 10000 


duplicate distribution 


uniform, Zipfian 


erroneous duplicates 


10% - 90% 


extent of error per tuple 


5% - 30% 


token swap error 


10% - 50% 



Table 5.2: Range of Parameters Used For Erroneous Datasets 



We show accuracy results for 8 different erroneous datasets generated from a data 
set of company names, each containing 5000 tuples generated from 500 clean records, 
with uniform distribution. We choose to limit the size of the data sets to facilitate 
experiments and data collection since each experiment is run multiple times to obtain 
statistical significance. We conducted experiments with data sets of increasing size and 
we observed that the overall accuracy trend presented remains the same. We consider the 
results presented highly representative across erroneous data sets (generated according 
to our methodology) of varying sizes, and duplicate distributions. We classify these 8 
datasets into dirty, medium and low error datasets based on the parameters of data 
generation. We have also generated 5 datasets, each having only one specific type of 
error, in order to evaluate the effect of specific error types. Table 15.31 provides more 
details on the datasets. Table 15.41 shows a sample of duplicates generated by the data 
generator from CU1 and CU5. 
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Class 


Name 


Percentage of 


erroneous 
duplicates 


errors in 
duplicates 


token 
swap 


Abbr. 
error 


Dirty 


CU1 


90 


30 


20 


50 


Dirty 


CU2 


50 


30 


20 


50 


Medium 


CU3 


30 


30 


20 


50 


Medium 


CU4 


10 


30 


20 


50 


Medium 


CU5 


90 


10 


20 


50 


Medium 


CU6 


50 


10 


20 


50 


Low 


CU7 


30 


10 


20 


50 


Low 


CU8 


10 


10 


20 


50 




Fl 


50 








50 




F2 


50 





20 







F3 


50 


10 










F4 


50 


20 










F5 


50 


30 









Table 5.3: Classification of Datasets 
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CU1 




Stsalney Morgan cncorporsated Group 


tl 


jMorgank Stanlwey Grouio Inc. 


tl 


Morgan Stanley Group Inc. 


t\ 


Sanlne Morganj Inocrorpated Group 


tl 


Sgalet Morgan Icnorporated Group 




CU5 


tl 


Morgan Stanle Grop Incorporated 


t 5 


Stalney Morgan Group Inc. 


t 5 


Morgan Stanley Group In. 


t 5 
l A 


Stanley Moragn Grou Inc. 




Morgan Stanley Group Inc. 



Table 5.4: Sample Tuples from CU1 & CU5 Datasets 

5.2 Evaluating Accuracy 

We measure the accuracy of predicates, utilizing known methods from the information re- 
trieval literature in accordance to common practice in IR [27]. We compute the Mean Av- 
erage Precision (MAP) and Mean Maximum F\ scores of the rankings of each dataset im- 
posed by approximate selection queries utilizing our predicates. Average Precision(AP), 
is the average of the precision after each similar record is retrieved, i.e., 

Elixir) * reljr)] 
number of relevant records 

where N is the total number of records returned, r is the rank of the record, i.e., the 
position of the record in the result list sorted by decreasing similarity score, P(r) is the 
precision at rank r, i.e., the ratio of the number of relevant records having rank < r to 
the total number of records having rank < r, and rel(r) is 1 if the record at rank r is 
relevant to the query and otherwise. This measure emphasizes returning more similar 
strings earlier. MAP is the mean AP value over a set of queries. Maximum Fi measure 
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is the maximum Fx score (the harmonic mean of precision and recall) over the ranking 
of records, i.e., 



where Pr(r) and Re(r) are precision and recall values for rank r. Pr(r) is as defined 
above. Re(r) is the ratio of the number of relevant records having rank < r to the total 
number of relevant records. Again, we compute mean maximum F x over a set of queries. 

Our data generation methodology allows to associate easily a clean tuple with all 
erroneous versions of the tuple generated using our data generator. A clean tuple and its 
erroneous duplicates are assigned the same cluster id. Essentially each time we pick a 
tuple from a cluster, using its string attribute as a query we consider all the tuples in 
the same cluster (tuples with the same cluster id) as relevant to this query. For each 
query and a specific predicate, we return a list of tuples sorted in the order of decreasing 
similarity scores. Thus, it is easy to identify relevant and irrelevant records among the 
results returned for a specific query and similarity predicate. In order to maintain our 
evaluation independent of any threshold constants (specified in approximate selection 
predicates) we do not prune this list utilizing thresholds. For each dataset, we compute 
the mean average precision and mean maximum Fx measure over 500 randomly selected 
queries taken from that data set (notice that our query workload contains both clean as 
well as erroneous tuples). Thus, our accuracy results represent the expected behaviour of 
the predicates over queries and thresholds. We report the values for MAP only since the 
results were consistently similar for max Fl measure in all our experiments. 

5.3 Settings 

5.3.1 Choice of Weights for Weighted Overlap Predicates 

Both WeightedMatch (WM) and WeightedJaccard (WJ) predicates require a weighting 
scheme to assign weights to the tokens. It is desirable to use a weighting scheme which 





Chapter 5. Evaluation 



32 



captures the importance of tokens. We experimented with idf and the Robertson-Spark 
Jones (RS) weighting scheme given in Equation 13.51 and found that RS weights lead to 
better accuracy. So in the following discussion, we use RS weights for weighted overlap 
predicates. 

5.3.2 Parameter Settings for Predicates 

For all predicates proposed previously in the literature we set any parameter values they 
require for tuning as suggested in the respective papers. For the predicates presented 
herein for data cleaning tasks, for the case of BM25, we set ki=1.5, k 3 =8 and 6=0.675; 
for HMM, we set a to 0.2, although our experiments show that the accuracy results are 
not very sensitive to the value of ao as long as a reasonable value is chosen (i.e., a value 
not close to or 1). 

The SoftTFIDF predicate requires a similarity predicate over the word tokens. We 
experimented with various similarity predicates like Jaccard, IntersectSize, edit distance, 
Jaro- Winkler, etc. and choose Jaro- Winkler since SoftTFIDF with Jaro- Winkler (STfldf 
w/JW) performs the best. This was also observed in [7]. Two words are similar in Soft- 
TFIDF if their similarity score exceed a given threshold 9. SoftTFIDF with Jaro- Winkler 
performed the best with 0=0.8. Finally, we set Cj ns for GES predicate to 0.5 as proposed 
in [I]. For calculating accuracy, we use the exact GES as shown in Equation 13. 141 We 
remark that we do not prune the results based on any threshold in order to keep the 
evaluation independent of the threshold values. 

5.3.3 Q-gram Generation 

Qgram generation is a common preprocessing step for all predicates. We use an SQL 
statement similar to that presented in [H] to generate q-grams, with a slightly different 
approach. We first insert q—1 special symbols (e.g. $) in place of all whitespaces in each 
string, as well as at the beginning and end of the strings. In this way we can fully capture 
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Type of Error 


Xect 


Jac. 


WM 


WJ 


Cosine, BM25, 
LM, HMM 


ED 


GES 


S Tfldf 

w/JW 


abbr. error (Fl) 


0.94 


0.96 


0.98 


1.0 


1.0 


0.89 


1.0 


1.0 


token swap error (F2) 


1.0 


1.0 


1.0 


1.0 


1.0 


0.77 


0.94 


1.0 



Table 5.5: Accuracy: Abbr. and Token Swap Errors 



all errors caused by different orders of words, e.g., "Department of Computer Science" 
and "Computer Science Department". For qgram generation we also need to have an 
optimal value of qgram size (q). A lower value of q ignores the ordering of characters 
in the string while a higher value can not capture the edit errors. So an optimum value 
is required to capture the edit errors taking in account the ordering of characters in 
the string. The table below shows the accuracy comparison of different qgram based 
predicates (Jaccard, tf-idf (Cosine), HMM and BM25) in the dirty cluster of our data 
sets: 



q 


Jaccard 


Cosine 


HMM 


BM25 


2 


0.736 


0.783 


0.835 


0.840 


3 


0.671 


0.769 


0.807 


0.805 



The trend is similar for other predicates and the accuracy further drops for higher 
values of q. Thus, we set q=2 as it achieves the best accuracy results. 

5.4 Accuracy Results 

In this section we present a detailed comparison of the effectiveness of the similarity 
predicates in capturing the different types of error introduced in the data. 

Abbreviation error: Due to abbreviation errors, a tuple AT&T Incorporated gets 
converted to AT&T Inc. Note that 

Incorporated and Inc are frequent words in the company names database. For the 
query AT&T Incorporated, the unweighted overlap predicates Jaccard (Jac.) and In- 
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tersectSize (Xect) will assign to the tuple IBM Incorporated greater similarity score 
than to the tuple AT&T Inc since they just try to match tuples on the basis of common 
qgrams. Edit distance (ED) will behave similarly since it is cheaper to convert AT&T 
Incorporated to IBM Incorporated than to AT&T Inc. The weight based predicates are 
robust to abbreviation errors since they assign high weights to tokens corresponding to 
rare (important) words e.g. AT&T. Table 15.51 presents the accuracy of the predicates for 
the case of a data set with only abbreviation error (dataset Fl). All other predicates 
WeightedMatch(WM), Weighted Jaccrd(WJ), tf-idf(Cosine), BM25, HMM, Language Mod- 
eling(LM) and SoftTFIDF(STf Idf w/JW) had near perfect accuracy. Similar behaviour is 
observed when the percentage of duplicates and abbreviation error is varied. 

Token swap errors: Due to token swap errors, a tuple Beijing Hotel gets con- 
verted to Hotel Bei j ing. Suppose there is a tuple Bei j ing Labs present in the database, 
where Labs and Hotel are equally important tokens but more frequent than Beijing. 
For a query Beijing Hotel, edit distance and GES will claim Beijing Labs more sim- 
ilar to the query than Hotel Beijing. We remark that for accuracy calculation, we use 
exact GES as shown in Equation 13.141 All other predicates ignore the order of words, 
and hence will perform well for token swap errors. Table 15.51 shows the accuracy of the 
predicates for a data set with only token swap errors (dataset F2). All other predicates 
had near perfect accuracy. Similar trend is observed when the percentage of duplicates 
and token swap error is varied. 

Edit errors: Edit errors involve character insertion/ deletion/ replacement and char- 
acter swap. The number of positions of a string at which edit error has occurred defines 
the extent of the edit error. All the predicates discussed above are robust towards low edit 
errors but their accuracy degrades as the extent of edit error increases. Table 15.61 shows 
the accuracy result for various predicates for increasing edit error in the data (datasets 
F3, F4 and F5). The predicates giving near equal accuracy are grouped together. GES 
is most resilient to edit errors. Edit distance, designed to capture edit errors has aver- 
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Predicate group 


F3 


F4 


F5 


GES 


1.0 


.99 


.97 


BM25, HMM, LM, STfldf w/JW 


1.0 


.97 


.91 


edit distance 


.99 


.97 


.90 


WM , WJ, Cosine 


.99 


.93 


.85 


Jaccard (Jac), IntersectSize (Xect) 


.99 


.91 


.81 



Table 5.6: Accuracy: Only Edit Errors 



age performance. BM25, STfldf w/JW, and probabilistic predicates (LM and HMM) are 
competitive in catching edit errors and perform slightly better than edit distance. The 
weighted overlap predicates (WM and WJ) with RS weights perform equivalent to tf-idf 
(Cosine) but not as good as edit distance. Finally the unweighted overlap predicates 
Jaccard and IntersectSize perform the worst as they ignore the importance of tokens. 
Similar trend is observed when the percentage of erroneous duplicate is varied. 

5.4.1 Comparison of predicates 

Figure EH] shows MAP values for different predicates for the 3 classes of erroneous datasets 
described in Table 15.31 For the low error datasets, all the predicates perform well ex- 
cept edit distance, GES, IntersectSize and Jaccard. GES performs a little worse due to 
the presence of token swap errors, IntersectSize and Jaccard perform worse because of 
abbreviation errors and edit distance is the worst because of both factors. 

When the error increases, the three types of errors occur in combination and edit 
based predicates experience large accuracy degradation. The edit based predicates are 
already not good at handling token swap errors and the presence of edit errors deteriorates 
their effectiveness since the word token weights are no longer valid. This is not the case 
for the qgram based predicates since edit errors affect only a small fraction of qgrams 
and the remaining qgram weights are still valid. Consider a query <5=Morgan Stanley 
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Group Inc. over dataset CU5, where we expect to fetch the tuples shown in Table [5T4l 
The qgram based predicates are able to return all the tuples at the top 5 positions in 
the list according to similarity values. GES is not able to capture the token swap and it 
ranks t% and t\ at position 27 and 28 respectively. The edit distance predicate performs 
worse; both t\ and t\ are absent from the list of top 40 similar tuples. Both edit based 
predicates give high similarity score to tuples like Silicon Valley Group, Inc. for 
query Q primarily because of low edit distance between Stanley and Valley. 

The unweighted overlap predicates ignore the importance of qgrams and hence per- 
form worse than the predicates that incorporate weights. It is interesting to note that 
the weighted overlap predicates perform better than the tf-idf (cosine) predicate. This 
is due to the RS weighting scheme (Equation 13.51) for weight assignment of tokens which 
has been shown more accurate than the idf weighting scheme. The former captures im- 
portance of tokens more accurately than the latter. The language modeling predicates 
(HMM and LM), and BM25 are always the best in all the three datasets. The suc- 
cess of the SoftTFIDF is attributed to the underlying Jaro- Winkler word level similarity 
predicate which can match the words accurately even in the presence of high errors. 

We also experimented with the GES Jaccard and GES apa: . Both predicates make use of 
a threshold 9 to prune irrelevant records without calculating the exact scores. Depending 
on the value of 9, relevant records might also be pruned leading to a drop in accuracy. 
Table I5T71 shows the variation in accuracy for QES Jaccard and GES apx for threshold values 
(9) 0.7, 0.8 and 0.9 for dataset CU1 for which GES (with no threshold) has 69.7% accu- 
racy. For GES ap:r we used 5 min hash signatures in order to approximate the GES Jaccard . 
We observe that increasing the number of min-hash signatures takes more time with- 
out having a significant impact on accuracy (pretty soon it demonstrates diminishing 
returns). A small number of min hash signatures results in significant accuracy loss. 

Experimental results show that for suitable thresholds GES" 7accard performs as good as 
GES and the accuracy drops as the threshold increases. GES apx , being an approximation 
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Predicate 


9 =0.7 


9 =0.8 


9 =0.9 


Qj^^Jaccard 


0.692 


0.683 


0.603 


Q ES apx 


0.678 


0.665 


0.608 



Table 5.7: Accuracy of GES Predicates for Different Thresholds 

for GES Jaccar,i , performs slightly worse than GES Jaccard . Similar results were observed 
for other datasets. 



5.5 Performance Results 

In this section, we compare different predicates based on preprocessing time, query time 
and how well they scale when the size of the base table grows. As expected, the per- 
formance depends primarily on the size of the base table. Performance observations 
and trends remain relatively independent from the error rate of the underlying data sets. 
Thus, we present the experiments on the DBLP datasets with increasing size and medium 
amount of errors: 70% of erroneous duplicates, 20% extent of error, 20% token swap error 
and no abbreviation error. 

5.5.1 Preprocessing 

We divide preprocessing time for a data set to make it amenable for approximate selection 
queries into two phases. In the first phase, tokenization is performed. Qgrams are 
extracted from strings in the way described in section 15.3.31 and stored in related tables. 
Aggregate weighted (Cosine and BM25) and language modeling predicates (LM and 
HMM) are fastest in this phase, followed by overlap predicates (Xect and Jac.) with 
a small difference which is due to storing distinct tokens only. Combination predicates 
(GES Jac, GES apx and STf Idf w/JW) are considerably slower in this phase since they 
involve an extra level of tokenization into words. 
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In the next phase, related weights are calculated and assigned to tokens. In this 
phase, the fastest predicates are the overlap predicates and edit distance (ED) followed 
by QES Jaccard and SoftTFIDF that only require weight calculation for word tokens. 
Aggregate weighted and language modeling predicates are considerably slower since cal- 
culating weights in these predicates involves a lot of computation and creation of many 
intermediate tables. Language modeling (LM) is the slowest predicate among probabilis- 
tic predicates since it requires the maximum number of intermediate tables to be created 
and stored. GES apx requires to compute min-hash signatures for the tokens separately 
for a number of hash functions on top of the two level tokenization and IDF weight cal- 
culation, so it is the slowest of all predicates. Figure 15.21 shows the preprocessing times 
for all predicates on a dataset of 10,000 records with an average length of 37 charac- 
ters. GES apx in this Figure employs min-hash computation utilizing 5 hash functions 
(min hash signature size of 5). Preprocessing time for GES apx increases with increasing 
number of hash functions employed for min-hash signature calculation. 

5.5.2 Query time 

Query time for a predicate is the time taken to rank the tuples from the base table ac- 
cording to decreasing similarity score. Query time can also be divided into two phases: 
preprocessing the query string and computing similarity scores. The preprocessing part 
can itself be divided into tokenization and weights computation phases as done for prepro- 
cessing of the base relation. We didn't experience large variability in the time for query 
preprocessing among all predicates. As described in section 14.31 the score formulas for 
Language modeling and HMM are suitably modified by dropping query dependent terms 
which do not alter the similarity score order and hence, the accuracy of the predicates. 

Figure 15.31 shows the average query execution time of different predicates over 100 
queries on a table of 10,000 strings with an average length of 37 characters. The experi- 
mental results are consistent with our analysis. A comparison of the average query time 



Chapter 5. Evaluation 



39 



of the predicates shows that IntersectSize, Jaccard, WeightedMatch, WeightedJaccard, 
HMM, BM25 should be among the best since first, they just involve one join and second, 
the query token weights do not depend on idf and are easy to compute. We expect the 
Cosine predicate to follow these predicates as it has the additional overhead of calculating 
query weights which depend on idf of tokens. The Language Modeling predicate involves 
join of 3 tables, so it is comparatively slow. The GES based predicates are slowest of all 
since they involve identification of the best matching token among the tuples for each 
query token. GES apx has been designed to efficiently approximate GES Jaccard , so it is 
expected to be the fastest of all GES based predicates. Note that the filtering step of 
GES Jaccar<i , GES apx and edit distance require a suitable threshold 6. Lower value of 6 
results in poor filtering and high post-processing time, while higher value of 9 leads to 
loss of similar results and hence a drop in accuracy. We used #=0.8 for the filtering step 
in GES Jacmrd and GES apx and #=0.7 for edit distance, since these values balance the 
trade-off between the performance and precision for these predicates. For GES apx , we 
use 5 hash functions for min-hash calculation (min hash signature of 5). 



5.5.3 Scalability 

In order to investigate the scalability of our approach, we run experiments on DBLP 
datasets with sizes varying from 10k to 100k records. The variation in query time as the 
base table size increases is shown in Figure 15.41 The predicates with nearly equal query 
execution times have been grouped together. Group Gl includes predicates IntersectSize, 
WeightedMatch and HMM, and the group G2 includes Jaccard, WeightedJaccard, Cosine 
and BM25. For predicates other than combination predicates, the results are consistent 
with our analysis of query execution time presented in Section 15.5.21 The predicates in 
group Gl can be thought of having a weight of 1 for query tokens and they just require 
a single join to compute similar tuples. The predicates in group G2 take slightly more 
time than predicates in Gl since they have to calculate weights for query tokens. LM 
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requires join of three tables to get results so it is considerably slower than predicates in 
Gl and G2. For the case of combination predicates, query time depends highly on the 
value of threshold 9 used for these predicates and the number of words in the string. We 
use the same thresholds we used in Section 15.5.21 for these predicates. We also limit the 
size of the query strings to three words in order to be able to compare the values among 
different datasets with other predicates. The results show that combination predicates are 
significantly slower than other predicates since for each query token, we need to determine 
the best matching token from the base tuple using an auxiliary similarity function such 
as Jaccard and Jaro- Winkler, apart from the time needed to calculate related weights for 
word tokens. GES apx is the fastest in this cluster of predicates. Increasing the number 
of words in query strings considerably slows down these predicates. We excluded edit 
distance from this experiment because of its significantly poor accuracy. 

5.6 Performance Enhancements 

Apart from obvious ways of boosting the performance of algorithms such as modifying 
score formulas as described in Section H] and building indices on relations to improve 
execution plans of score calculation formulas, it is possible to enhance performance of 
the algorithms by using filtering and pruning methods. Filtering based methods try to 
find a set of tuples which are promising duplicates by dropping a considerable percent- 
age of dissimilar records without calculating exact scores. GES Jaccard j s an example of 
such techniques. Filtering based enhancement techniques for declarative framework are 
described in detail in [T6] . 

Pruning methods enhance the performance of algorithms mainly based on the nature 
of q-grams made out of strings. As results of our experiments suggest, in every data 
set, there is a huge number of qgrams that play very little or no role in the accuracy of 
predicates. Therefore, a very reasonable policy is to drop those q-grams in favor of space 
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and running time. One way to do so is to prune weight tables by dropping tokens with 
weights less than a threshold during the preprocessing or the query execution phase. This 
method can be effective in some predicates, however, there are two problems associated 
with this approach. The first problem is to find the best threshold which depends on the 
data set and the predicate. The other is related to the characteristics of some predicates, 
specially language modeling and HMM predicates, where dropping tokens will ruin the 
probability distributions calculated for tokens and as a result, the score formulas would 
fail to calculate correct similarity scores. 

A better strategy that is shown to be very effective in our experiments is to prune 
base relation's tokens table based on IDF of tokens. It is analogous to the idea of 
removing stopwords e.g. the, an etc. from the documents for efficient keyword search. 
This approach has several advantages. Although some extra steps will be added to 
the preprocessing phase, as a results of pruning, other steps of preprocessing will be 
considerably faster and overall, we gain substantial performance improvement in the 
preprocessing phase (except for unweighted naive predicates where there is no step other 
than preprocessing). Since all weights are calculated from the pruned tokens table, 
the probability distributions of tokens will remain meaningful. The benefit for query 
execution time and the effect on accuracy depends on the threshold used for pruning. 

Figure [5751 shows the effect of threshold used for pruning on MAP and execution time 
for a dirty data set of company names. We use a threshold in the form of MIN(idf) + 
rate * (MAX(idf) — MIN(idf)) and change the rate from (i.e., no pruning) to 0.5. 
As we can see in the figure, in the unweighted naive predicates, pruning results in a 
considerable gain in accuracy. This is obviously due to dropping tokens with low IDF 
from strings, so the similarity score for these predicates will be intersection and Jaccard 
coefficient of only important tokens. Interestingly, with low threshold values, accuracy 
of all other predicates benefit from pruning. This shows that tokens with very low IDF 
do not play an effective role in the similarity scores. For our example data set, a pruning 
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rate between 0.2 and 0.3 has the best effect on both accuracy and a remarkable effect on 
performance, while rate up to 0.4 results in reasonable drop in accuracy and little more 
gain in performance. 

To further investigate the effect of pruning based on IDF on performance and accuracy 
of predicates, we examine the IDF distribution in our data sets. Figure 15.61 shows the 
distribution of IDF weights for CU1 data set. The distribution is similar in all other data 
sets. As it can be seen, there is a huge number of tokens with low IDF. For this data set, 
a pruning rate of 0.33 will drop nearly 150,000 out of 250,000 tokens which results in a 
huge performance gain and a very little drop in accuracy of predicates (except unweighted 
naive predicates that benefit from pruning as described above). 

5.7 Summary of Evaluation 

We presented an exhaustive evaluation of approximate selection predicates by group- 
ing them into five classes based on their characteristics: overlap predicates, aggregate 
weighted predicates, edit-based predicates, combination predicates and language model- 
ing predicates. We experimentally show how predicates in each of these classes perform 
in terms of accuracy, preprocessing and execution time. Within our framework, the 
overlap predicates are relatively efficient but have low accuracy. Edit based predicates 
perform worse in terms of accuracy but are relatively fast due to the filtering step they 
employ. The aggregate weighted predicates, specifically BM25, perform very well both 
in terms of accuracy and efficiency Both the predicates from the language modeling 
cluster perform well in terms of accuracy. Moreover, HMM is as fast as simple overlap 
predicates. The combination predicates are considerably slow due to their two levels of 
tokenization. Among the combination predicates, GES based predicates are robust in 
handling edit errors but fail considerably in capturing token swap errors. SoftTFIDF 
with Jaro- Winkler performs nearly equal to BM25 and HMM and is among the best in 
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terms of accuracy, although it is the slowest predicate. This establishes the effectiveness 
of BM25 and HMM predicates for approximate matching in large databases. 



Chapter 5. Evaluation 44 



0.9 - 


0.7 - 
0.6 
0.5 - 
0.4 



0.3 - 


Xect 


Jac. 


WM 


WJ 


Cosine 


BM25 


LM 


HMM 


ED 


GES 


STfldf 

w/JW 


Clean 


0.9273 


0.9506 


0.9803 


0.9908 


0.9877 


0.9893 


0.9896 


0.9891 


0.8100 


0.9323 


0.9739 



fa) Low Error Datasets 



1.0 1 
0.9 
0.8 
0.7 

0.6 f- 
0.5 
0.4 - 
0.3 



Cosine BM25 LM HMM ED GES 



0.9337 0.9510 0.9564 0.9520 0.6766 0.8542 0.9522 



(b) Medium Error Datasets 



1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 



[Dirty 



I . I 



Xect Jac. WM 



0.7074 0.7361 0.7932 



WJ Cosine BM25 



0.8095 0.7835 0.8408 



LM 



0.8360 



HMM ED GES 



0.8356 0.4910 0.6644 



STfldf 

w/JW 



0.8502 



(c) Dirty Datasets 
Figure 5.1: MAP values for different predicates on different datasets 
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Figure 5.2: Preprocessing time of different predicates 
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Figure 5.3: Query time of different predicates 
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Figure 5.5: Effect of Pruning on MAP and Execution Time of Different predicates 
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Conclusions 



We proposed new similarity predicates for approximate selections based on probabilistic 
information retrieval and presented their declarative instantiation. We presented an 
in-depth comparison of accuracy and performance of these new predicates along with 
existing predicates, grouping them into classes based on their primary characteristics. 
Our experiments show that the new predicates are both effective as well as efficient for 
data cleaning applications. 
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Appendix A 

Data preparation SQL Statements 



We assume the base relation BASE_TABLE has an integer tuple id attribute tid and a 
string valued attribute string. The following SQL statements tokenize the base relation, 
creating BASE_TOKENS(tid, token). Assuming that the query relation QUERY_TABLE 
has a single string valued attribute string, the same SQL statements can be used for 
tokenization of the query string by removing tid from the statements. 



A.l Qgram generation 



— MAX_STR_SIZE is the maximum string length and q is the size of the qgrams . 
INSERT INTO INTEGERS (i) VALUES (1), (2), (MAX_STR_SIZE + (<?-l)) 

INSERT INTO BASE_TOKENS (tid , token) 

SELECT tid, SUBSTRING(CONCAT(SUBSTRING( '$...$', 1 ,<j-l) , 

UPPER( REPLACE (C0NCAT( string) , ' ' ,SUBSTRING( '$...$', 1 ,<j-l) )) , 
SUBSTRING( *$...$' .i.g-l)) , INTEGERS. i, q) 

FROM INTEGERS INNER JOIN BASE.TABLE ON 

INTEGERS. i <= LENGTH ( REPLACE(CONCAT(string) , ' ', SUBSTRING( '$...$', 1 ,q-l) ) ) + (<j-l) 
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Appendix A. Data preparation SQL Statements 
A. 2 Word token generation 
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INSERT 


INTO BASE_TOKENS (t id , token) 




SELECT 


tid, SUBSTRING (C0NCAT( string) , 1, LOCATE (' ', CONCAT ( string) ) - 


1) 


FROM 


BASE.TABLE 




WHERE 


LOCATE ( ' CONCAT (string)) > 




UNION 


ALL 




SELECT 


tid, SUBSTRING (CONCAT ( string) , Nl.I+1, N2 . I - Nl.I-1) 




FROM 


BASE.TABLE, INTEGERS Nl, INTEGERS N2 




WHERE 


Nl.I = LOCATE (' ' , CONCAT ( string) , Nl.I) AND N2 . I = LOCATE (' ' , 


CONCAT(string) , Nl.I + 1) 


UNION 


ALL 




SELECT 


tid, SUBSTRING (CONCAT ( string) , LENGTH(CONCAT(string) ) - LOCATE ( 


', REVERSE(C0NCAT(string)))+2) 


FROM 


BASE.TABLE 




WHERE 


LOCATE ( ' ', CONCAT(string)) > 




UNION 


ALL 




SELECT 


tid, CONCAT(string) 




FROM 


BASE.TABLE 




WHERE 


LOCATE ( 1 ', CONCAT(string)) =0 





A. 3 Qgram generation of the word tokens (for com- 



bination predicates) 


INSERT INTO BASE_QGRAMS (t id , token, qgram) 




SELECT tid, token, 




SUBSTRING(CONCAT(SUBSTRING('$. . . $ 1 , 1 , q-1) , UPPER(token) , SUBSTRING ( 1 $ . . . 


$',1,<J-D), INTEGERS. I, q) 


FROM INTEGERS INNER JOIN BASE.TOKENS ON INTEGERS . I <= LENGTH (token) + (q-1) 




GROUP BY tid, token, qgram 





Appendix B 



SQL Statements for Predicates 



B.l Overlap Predicates 



B.l.l IntersectSize 



Query 

INSERT INTO I NTERSECT_RESULTS ( t i d , score) 

SELECT Rl.tid, C0UNT(*) 

FROM BASE.TOKENS Rl, QUERY.TOKENS R2 

WHERE Rl. token = R2. token 

GROUP BY Rl.tid 
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Appendix B. SQL Statements for Predicates 
B.1.2 Jaccard 



Preprocessing 


INSERT 


INTO BASE_DDL(tid, ddl) 




SELECT 


T.tid, C0UNT(*) 




FROM 


BASE.TOKENS T 




GROUP BY T.tid 




INSERT 


INTO BASE_TOKENSDDL(tid, token, ddl) 




SELECT 


T.tid, T. token, D.ddl 




FROM 


BASE.TOKENS T, BASE_DDL D 




WHERE 


T.tid = D.tid 




Query 


INSERT 


INTO JACCARD_RESULTS(tid, score) 




SELECT 


Sl.tid, COUNT (*)/ (SI. ddl + S2.ddl - COUNT(* 


)) 


FROM 


BASE.TOKENSDDL SI, QUERY.TOKENS R2, (SELECT 


COUNT (*) AS ddl FROM QUERY.TOKENS T) S2 


WHERE 


SI. token = R2. token 




GROUP BY Sl.tid 





Appendix B. SQL Statements for Predicates 
B.1.3 WeightedMatch 



Preprocessing 


INSERT 


INTO BASE.SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT 


INTO BASE_TF(tid, token, tf) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY T.tid, T. token 


INSERT 


INTO BASE.BMIDF (token, midf) 


SELECT 


T. token, L0G(S.SIZE - COUNT(T.tid) + 0.5) - LOG (COUNT (T . tid) + 0.5) 


FROM 


BASE.TF T, BASE.SIZE S 


GROUP BY T. token 


INSERT 


INTO BASE_WEIGHTS (tid, token, weight) 


SELECT 


T.tid, T. token, I. midf 


FROM 


BASE.BMIDF I, BASE.TF T 


WHERE 


I. token = T. token 


Query 


INSERT 


INTO WEIGHTEDMATCH_RESULTS (t idl , tid2, score) 


SELECT 


Wl.tid, T2.tid, SUM(W1 .weight) 


FROM 


BASE.WEIGHTS Wl, QUERY.TOKENS T2 


WHERE 


Wl. token = T2. token 


GROUP BY T2.tid, Wl.tid 



Appendix B. SQL Statements for Predicates 



B.1.4 Weighted Jaccard 



Preprocessing 


INSERT 


INTO BASE_SIZE(size) 




SELECT 




COUNT (*) 




FROM 




BASE.TABLE 




INSERT 


INTO BASE_TF(tid, token, tf) 




SELECT 




T.tid, T. token, COUNT(*) 




FROM 




BASE.TOKENS T 




GROUP BY 


T.tid, T. token 




INSERT 


INTO BASE.BMIDF (token, midf) 




SELECT 




T. token, L0G(S.SIZE - COUNT(T.tid) + 0.5) 


- LOG (COUNT (T . tid) + 0.5) 


FROM 




BASE.TF T, BASE.SIZE S 




GROUP BY 


T. token 




INSERT 


INTO BASE_WEIGHTS(tid, token, weight) 




SELECT 




T.tid, T. token, I. midf 




FROM 




BASE.BMIDF I, BASE.TOKENS T 




WHERE 




I. token = T. token 




INSERT 


INTO BASE_DDL(tid, ddl) 




SELECT 




W.tid, SUM(weight) 




FROM 




BASE.WEIGHTS W 




GROUP BY 


W.tid 




INSERT 


INTO BASE_TOKENSDDL(tid, token, ddl, weight) 




SELECT 




W.tid, W. token, D . DDL , W. WEIGHT 




FROM 




BASE.WEIGHTS W, BASE.DDL D 




WHERE 




W.tid = D.tid 




Query 


INSERT 


INTO WJ_RESULTS(tid, score) 




SELECT 




Sl.tid, SUM(Sl.weight)/(Sl.ddl + S2.ddl - 


SUM(S1. weight)) 


FROM 




BASE.TOKENSDDL SI, QUERY.TOKENS R2, 








( SELECT SUM (T. weight) AS ddl 








FROM (SELECT T. token, I . IDF AS weight 








FROM BASE.IDF I, QUERY.TOKENS T 








WHERE I. token = T. token) T ) S2 




WHERE 




SI. token = R2. token 




GROUP 




BY Sl.tid 
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B.2 Aggregate Weighted Predicates 



B.2.1 Tfidf Cosine Predicate 



Preprocessing 


INSERT INTO BASE.SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT INTO BASE.IDF (token, idf) 


SELECT 


T. token, LOG(S.SIZE) - LOG(COUNT(DISTINCT T.tid)) 


FROM 


BASE.TOKENS T, BASE.SIZE S 


GROUP BY 


T. token 


INSERT INTO BASE_TF(tid, token, tf) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY 


T.tid, T. token 


INSERT INTO BASE_LENGTH(tid, len) 


SELECT 


T.tid, SQRT(SUM(I.idf *I.idf *T.tf *T.tf )) 


FROM 


BASE.IDF I, BASE.TF T 


WHERE 


I. token = T. token 


GROUP BY 


T.tid 


INSERT INTO BASE.WEIGHTS (tid, token, weight) 


SELECT 


T.tid, T. token, I . idf *T. tf /L . len 


FROM 


BASE.IDF I, BASE.TF T, BASE_LENGTH L 


WHERE 


I. token = T. token AND T.tid = L.tid 



Appendix B. SQL Statements for Predicates 



Query 

INSERT INTO COSINE_RESULTS(tid, score) 

SELECT RlW.tid, SUM(RlW.weight*R2W. weight) 

FROM BASE.WEIGHTS R1W, 

(SELECT T. token, QIDF. idf *QTF.tf /QLEN. length AS weight 
FROM (SELECT R. token, R.idf 

FROM QUERY.TOKENS S, BASE.IDF R 

WHERE S. token = R. token 

GROUP BY S. token) QIDF, 

(SELECT T. token, C0UNT(*) AS tf 

FROM QUERY.TOKENS T 

GROUP BY T.tid, T. token) QTF, 

(SELECT SQRT (SUM (QIDF . idf *QIDF. idf *QTF . tf *QTF . tf ) ) AS length 
FROM (SELECT R. token, R.idf 

FROM QUERY.TOKENS S, BASE.IDF R 
WHERE S. token = R. token 
GROUP BY S. token) QIDF, 
(SELECT T. token, C0UNT(*) AS tf 
FROM QUERY.TOKENS T 
GROUP BY T. token) QTF 
WHERE I. token = T. token) QLEN 
WHERE QIDF. token = QTF. token) R2W 
WHERE R1W. token = R2W. token 

GROUP BY RlW.tid 
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B.2.2 


BM25 Predicate 


Preprocessing 


INSERT INTO BASE_SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT INTO BASE_TF(tid, token, tf) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY 


T.tid, T. token 


INSERT INTO BASE.BMIDF (token, midf) 


SELECT 


T. token, L0G(S.SIZE - COUNT(T.tid) + 0.5) - LOG (COUNT (T . tid) + 0.5) 


FROM 


BASE.TF T, BASE.SIZE S 


GROUP BY 


T. token 


INSERT INTO BASE_BMBASELENGTH (tid , len) 


SELECT 


T.tid, SUM(T.tf) 


FROM 


BASE.TF T 


GROUP BY 


T.tid 


INSERT INTO BASE_BMBASEAVGLENGTH(avglen) 


SELECT 


AVG(len) 


FROM 


BASE_BMBASELENGTH 


INSERT INTO BASE_BMBASEMODTF(tid, token, mtf) 


SELECT 


T.tid, T. token, (T.tf * (fci+1) ) / ( ((( 1 - 6) + (6*L.DL/A.AVGDL))*fci ) + T.tf ) 


FROM 


BASE.BMBASELENGTH L, BASEJ3MBASEAVGLENGTH A , BASE_TF T 


WHERE 


L.tid = T.tid 


INSERT INTO BASE_BMBASEWEIGHTS (tid, token, weight) 


SELECT 


T.tid, T. token, T .mtf *I .midf 


FROM 


BASE_BMBASEMODTF T, BASE.BMIDF I 


WHERE 


T. token = I. token 


Query 


INSERT INTO BM25_RESULTS(tid, score) 


SELECT 


B.tid, SUM(B. weight * S.mtf) 


FROM 


BASE.BMBASEWEIGHTS B, 




(SELECT token, (COUNT (*))*(fc 3 +l) / (fc 3 +C0UNT(*) ) AS mtf 




FROM QUERY.TOKENS T 




GROUP BY T. token) S 


WHERE 


B. token = S. token 


GROUP BY 


B.tid 
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B.3 Language Modeling Predicates 



B.3.1 


Language Modeling 


Preprocessing 


INSERT INTO BASE_TF(tid, token, tf) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY 


T.tid, T. token 


INSERT INTO BASE_DL(tid, dl) 


SELECT 


T.tid, C0UNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY 


T.TI 


INSERT INTO BASE_PML(tid, token, pml) 


SELECT 


T.tid, T. token, T.tf/D.dl 


FROM 


BASE.TF T, BASE.DL D 


WHERE 


T.tid=D.tid 


INSERT INTO BASE_PAVG(tid, token, pavg) 


SELECT 


P. token, AVG(P.pml) 


FROM 


BASE.PML P 


GROUP BY 


P . token 


INSERT INTO BASE_FREQ(tid, token, freq) 


SELECT 


T.tid, T. token, P.pavg*D.dl 


FROM 


BASE.TF T, BASE_PAVG P, BASEJ5L D 


WHERE 


T. token = P. token AND T.tid=D.tid 


INSERT INTO BASE_RISK(tid, token, risk) 


SELECT 


T.tid, T. token, (1 .0/(1 .O+q .freq) ) * (P0WER(Q . freq/ (1 . O+Q . freq) , T.tf)) 


FROM 


BASE.TF T, BASE_FREQ Q 


WHERE 


T.tid=Q.tid AND T. token = Q. token 


INSERT INTO BASE_TSIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TOKENS 


INSERT INTO BASE.CFCS (token, cfcs) 


SELECT 


T. token, C0UNT(*) / S.size 


FROM 


BASE.TOKENS T, BASE.TSIZE S 


GROUP BY 


T. token 
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INSERT 


INTO BASE_PM(tid, token, pm) 




SELECT 


T.tid, T. token, 1EO* POWER (M.pml, l.O-R.risk) * POWER(A 


pavg, R.risk), C.cfcs 


FROM 


BASE.TF T, BASE.RISK R, BASE.PML M, BASE_PAVG A, BASE.CFCS C, BASE.TSIZE S 


WHERE 


T.tid=R.tid AND T. token = R. token AND T.tid=M.tid AND 






T. token = M. token AND T. token = A. token AND T. token = C 


token 


INSERT 


INTO BASE_SUMCOMPMBASE (t id , sumcompm) 




SELECT 


P.tid, SUM(L0G(1.0-P.pm)) 




FROM 


BASE.PM P 




GROUP BY P.tid 




Query 


INSERT 


INTO LM_RESULTS (tid, score) 




SELECT 


Bl.tid, Bl. score + B2 . sumcompm 




FROM 


(SELECT PI. tid, SUM(L0G(P1 .pm) ) - SUM(L0G(1 . 0-P1 .pm) ) - 


SUM(LOG(Pl.cfcs)) AS score 




FROM BASE.PM PI, QUERY.TOKENS T2 






WHERE PI. token = T2. token 






GROUP BY PI. tid) Bl, 






BASE_SUMCOMPMBASE B2 




WHERE 


Bl .tid=B2.tid 
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B.3.2 Hidden Markov Models 



Preprocessing 


INSERT INTO BASE_SIZE(size) 
SELECT COUNT (*) 
FROM BASE.TABLE 




INSERT INTO BASE_TF(tid, tf) 
SELECT T.tid, T. token, C0UNT(*) 
FROM BASE.TOKENS T 
GROUP BY T.tid, T. token 




INSERT INTO BASE_DL(tid, dl) 
SELECT T.tid, C0UNT(*) 
FROM BASE.TOKENS T 
GROUP BY T.tid 




INSERT INTO BASE_PML(tid, token, pml) 

SELECT T.tid, T. token, F.tf/D.dl 

FROM BASE.TOKENS T, BASE.TF F, BASEJ3L D 

WHERE F.tid=T.tid AND T. token = F. token AND T 


.tid=D.tid 


INSERT INTO BASE_SUMDL(sdl) 
SELECT SUM(T.dl) 
FROM BASE.DL T 




INSERT INTO BASE_PTGE (token, ptge) 
SELECT T. token, SUM(T . tf ) /D. sdl 
FROM BASE.TF T, BASE.SUMDL D 
GROUP BY T. token 




INSERT INTO BASE_WEIGHTSHMM(tid, token, weight) 
SELECT M.tid, M. token, LOG( (1 + (ai*M.pml) / 
FROM BASE.PTGE P, BASE_PML M 
WHERE P. token = M. token 
GROUP BY tid, token 


(a2*P.ptge)) ) 


Query 


INSERT INTO HMM_SCORES (t id , score) 
SELECT Wl.tid, EXP(SUM(W1 .weight) ) 
FROM BASE.WEIGHTS Wl, QUERY.TOKENS T2 
WHERE Wl. token = T2. token 
GROUP BY Wl.tid 
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B.4 Combination Predicates 



B.4.1 GES Jaccard 



Preprocessing 


INSERT 


INTO BASE_SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT 


INTO BASE.IDF (token, idf) 


SELECT 


T. token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid)) 


FROM 


BASE.TOKENS T, BASE.SIZE S 


GROUP BY T. token 


INSERT 


INTO BASE.IDFAVG(idfavg) 


SELECT 


AVGQ.idf ) 


FROM 


BASE.IDF I 


INSERT 


INTO BASE_TOKENSIZE (t id , token, size) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.QGRAMS T 


GROUP BY T.tid, T.TOKE 


INSERT 


INTO BASE_QGRAMSTOKENSIZE (t id , token, qgram, size) 


SELECT 


T.tid, T. token, T. qgram, S.size 


FROM 


BASE.QGRAMS T, BASE.TOKENSIZE S 


WHERE 


T.tid = S.tid AND T. token = S. token 
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68 



Query (Filtering Step) 


SELECT 


MAXSIM.tid, R. string 






(1.0 - 1.0/c?) + (l/SUM(QIDF.idf )) * SUM(QIDF.idf * (((2.0/g) 


* MAXSIM. maxsim) ) ) AS score 


FROM 


( SELECT JACSIM.tid, JAC_SIM.token2, MAX(sim) AS maxsim 






FROM (SELECT BSIZE.tid AS tid, BSIZE. token AS tokenl , Q. token 


AS token2, 




COUNT(*)/(BSIZE.size + QSIZE. size - C0UNT(*)) AS 


sim 




FROM BASE.QGRAMSTOKENSIZE BSIZE, QUERY.QGRAMS Q, 






(SELECT T. token, C0UNT(*) AS size 






FROM QUERY.QGRAMS T 






GROUP BY T. token) QSIZE 






WHERE BSIZE. qgram = Q.qgram AND Q. token = QSIZE. token 






GROUP BY BSIZE.tid, BSIZE. token, Q. token) JAC.SIM 






GROUP BY JACSIM.tid, JAC_SIM.token2 ) MAXSIM, 






(SELECT R. token, R.idf 






FROM QUERY.TOKENS S, BASE.IDF R 






WHERE S. token = R. token 






GROUP BY S. token 






UNION 






SELECT S. token, A.IDFAVG AS idf 






FROM QUERY.TOKENS S, BASE.IDFAVG A 






WHERE S. token NOT IN (SELECT I. token FROM BASE.IDF I) 






GROUP BY S. token) QIDF, 






BASE.TABLE R 




WHERE 


TM.token2 = I. token AND R.tid = MAXSIM.tid 




GROUP BY 


TM.tid 




HAVING 


score >= 9 
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B.4.2 GES apx 



Preprocessing 


INSERT 


INTO BASE_SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT 


INTO BASE.IDF (token, size) 


SELECT 


T. token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid)) 


FROM 


BASE.TOKENS T, BASE.SIZE S 


GROUP BY T. token 


INSERT 


INTO BASE.IDFAVG(idfavg) 


SELECT 


AVG(I.idf) 


FROM 


BASE.IDF I 


INSERT 


INTO B ASE_HASHFUNC (fid, func) 


SELECT 


N.i-1, round (rand ()*MAXINT) 


FROM 


INTEGERS N 


LIMIT 


HASH.SIZE 


INSERT 


INTO BASE_HASHVALUE(f id, qgram, value) 


SELECT 


F.FID, Q. QGRAM, MOD(CONV(HEX( Q. qgram), 16, 10) * MAXINT, F.func) 


FROM 


BASE.HASHFUNC F, (SELECT DISTINCT QGRAM FROM BASE.QGRAMS) Q 


INSERT 


INTO BASE_MINHASHSIGNATURE (t id , token, fid, score) 


SELECT 


Q.tid, Q. token, H.fid, MIN(H. value) 


FROM 


BASE.QGRAMS Q, BASE.HASHVALUE H 


WHERE 


Q. QGRAM = H. QGRAM 


GROUP BY Q.tid, Q. token, H.fid 
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Query (Filtering Step) 


SELECT 


MAXSIM.tid, R. string, 




(1.0 - 1.0/cj) + (l/SUM(I.idf )) * SUM(I.idf * (((2.0/g) * MAXSIM .maxsim) )) AS score 


FROM 


(SELECT MH.SIM.tid, MH_SIM.token2, MAX(sim) AS maxsim 




FROM (SELECT BMHSIG.tid AS tid, BMHSIG. token AS tokenl , 




QMHSIG. token AS token2, C0UNT(*)/H AS sim 




FROM BASE_MINHASHSIGNATURE BMHSIG, 




(SELECT Q. token, H.fid, MIN(H. value) AS value 




FROM QUERY.QGRAMS Q, BASEJJASHVALUE H 




WHERE Q.qgram = H.qgram 




GROUP BY Q. token, H.fid) QMHSIG 




WHERE BMHSIG. fid = QMHSIG. fid AND BMHSIG. value = QMHSIG. value 




GROUP BY BMHSIG.tid, BMHSIG. token, QMHSIG. token) MH.SIM 




GROUP BY MH.SIM.tid, MH_SIM.token2) MAXSIM, 




(SELECT R. token, R.idf 




FROM QUERY.TOKENS Q, BASE.IDF R 




WHERE Q. token = R. token 




GROUP BY Q. token 




UNION 




SELECT Q. token, A.IDFAVG AS idf 




FROM QUERY.TOKENS Q, BASE.IDFAVG A 




WHERE Q. token NOT IN (SELECT I. token FROM BASE.IDF I) 




GROUP BY Q. token) QIDF, 




BASE.TABLE R 


WHERE 


MAXSIM. token2 = I. token AND R.tid = MAXSIM.tid 


GROUP BY 


MAXSIM.tid 


HAVING 


score >= 
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B.4.3 SoftTFIDF 



Preprocessing 


INSERT INTO BASE_SIZE(size) 


SELECT 


COUNT (*) 


FROM 


BASE.TABLE 


INSERT INTO BASE.IDF (token, idf) 


SELECT 


T. token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid)) 


FROM 


BASE.TOKENS T, BASE.SIZE S 


GROUP BY 


T. token 


INSERT INTO BASE_TF(tid, token, tf) 


SELECT 


T.tid, T. token, COUNT(*) 


FROM 


BASE.TOKENS T 


GROUP BY 


T.tid, T. token 


INSERT INTO BASE_LENGTH(tid, len) 


SELECT 


T.tid, SQRT(SUM(I.idf *I.idf *T.tf *T.tf )) 


FROM 


BASE.IDF I, BASE.TF T 


WHERE 


I. token = T. token 


GROUP BY 


T.tid 


INSERT INTO BASE.WEIGHTS (tid, token, weight)) 


SELECT 


T.tid, T. token, I . idf *T. tf /L . len 


FROM 


BASE.IDF I, BASE.TF T, BASE_LENGTH L 


WHERE 


I. token = T. token AND T.tid = L.tid 
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Query 

INSERT INTO SOFTTFIDF_RESULTS (tid , score) 

SELECT MAXTDKEN.tid, SUM(WB. weight * WQ. weight * MAXTOKEN .maxsim) 
FROM BASE.WEIGHTS WB, 

(SELECT JARCLSIM.tid, JARO.SIM. tokenl, JAR0_SIM.token2, MAXSIM. maxsim 
FROM (SELECT JARD_SIM.tid, JARD_SIM.token2, MAX(sim) AS maxsim 
FROM (SELECT Rl.tid AS tid, Rl. token AS tokenl , 

R2. token AS token2, JaroWinkler (Rl .token, R2 .token) AS sim 
FROM BASE.TOKENS Rl, QUERY.TOKENS R2 
WHERE JaroWinkler (Rl. token, R2. token) >= 6) JARO.SIM 
GROUP BY JARO_SIM.tid, JAR0_SIM.token2) MAXSIM, 
(SELECT Rl.tid AS tid, Rl. token AS tokenl, 

R2. token AS token2, JaroWinkler (Rl .token, R2 .token) AS sim 
FROM BASE.TOKENS Rl, QUERY.TOKENS R2 
WHERE Jar oWinkler(Rl. token, R2. token) >= 6) JARO.SIM 
WHERE JARO.SIM.tid = MAXSIM. tid AND JAR0_SIM.token2 = MAXSIM. token2 

AND MAXSIM. maxsim = JARO.SIM. sim) MAXTOKEN, 
(SELECT QTF. token, QIDF . idf *QTF.tf /QLEN. length AS weight 
FROM (SELECT R. token, R.idf 

FROM QUERY.TOKENS S, BASE.IDF R 
WHERE S. token = R. token 
GROUP BY S. token) QIDF, 
(SELECT T. token, C0UNT(*) AS tf 
FROM QUERY.TOKENS T 
GROUP BY T. token) QTF, 

(SELECT QTF. tid, SQRT (SUM (QIDF. idf *QIDF . idf *QTF . tf *QTF . tf ) ) AS length 
FROM (SELECT R. token, R.idf 

FROM QUERY.TOKENS S, BASE.IDF R 
WHERE S. token = R. token 
GROUP BY S. token) QIDF, 
(SELECT T. token, C0UNT(*) AS tf 
FROM QUERY.TOKENS T 
GROUP BY T. token) QTF 
WHERE QIDF. token = QTF. token) QLEN 
WHERE QIDF. token = T. token) WQ, 
WHERE MAXTOKEN. token2 = WQ . token AND MAXTOKEN. tid = WB.tid AND MAXTOKEN . tokenl = WB. token 

GROUP BY MAXTDKEN.tid 



